Skip to content

Conversation

@hannesrudolph
Copy link
Collaborator

@hannesrudolph hannesrudolph commented Aug 2, 2025

Summary

Closes: #4940

This PR implements a persistent telemetry event queue system that ensures reliable delivery of telemetry events even during network outages or VS Code crashes. The system includes sophisticated multi-instance coordination to prevent duplicate processing when multiple VS Code instances are running.

Key Features

Core Queue Implementation

  • FIFO Queue Structure: Events are processed in the exact order they were generated
  • Persistence: Queue survives VS Code restarts using globalState API
  • Confirmation-based Removal: Events only removed after successful cloud transmission
  • Automatic Processing: Queue processes when new events are added and on extension startup
  • 1MB Storage Limit: Automatic cleanup of oldest events when approaching limit

Multi-Instance Support

  • Distributed Locking: Only one instance can process the queue at a time
  • Instance Identification: Each VS Code instance has a unique ID for tracking
  • Lock Expiration: 30-second timeout handles crashed instances gracefully
  • Atomic Operations: Retry logic with exponential backoff prevents data corruption
  • Multiple Modes:
    • compete (default): All instances compete for processing
    • leader: Leader election for single-instance processing
    • disabled: No coordination for backward compatibility

Diagnostic Commands

Added three new commands for monitoring and managing the queue:

  • telemetryQueueStatus: View queue status (count, size, failed events)
  • telemetryQueueProcess: Manually trigger queue processing
  • telemetryQueueClear: Clear all queued events

Implementation Details

New Components

  • TelemetryEventQueue: Core FIFO queue with retry logic
  • GlobalStateQueueStorage: Persistent storage with multi-instance locking
  • CloudQueueProcessor: Handles cloud transmission
  • QueuedTelemetryClient: Wrapper that intercepts telemetry events

Integration

  • Modified CloudService to use QueuedTelemetryClient
  • Added queue processing on extension startup in extension.ts
  • Integrated diagnostic commands in registerCommands.ts

Testing

  • Comprehensive test suite with 38 tests covering:
    • Basic queue operations
    • Storage limits and cleanup
    • Multi-instance scenarios
    • Lock acquisition and expiration
    • Concurrent operations
    • Edge cases

All tests are passing.

Benefits

  • No Lost Events: Telemetry persists through crashes and network issues
  • No Duplicates: Multi-instance coordination prevents duplicate processing
  • Ordered Delivery: FIFO processing maintains event chronology
  • Graceful Degradation: System continues generating events even if processing fails
  • Clear Debugging: Instance-specific logs help troubleshoot issues

Configuration

The system can be configured via environment variables or options:

multiInstance: {
  enabled: true,
  mode: 'compete', // or 'leader' or 'disabled'
  lockDurationMs: 30000,
  lockAcquireTimeoutMs: 10000
}

This ensures reliable telemetry delivery for better product insights and debugging capabilities.


Important

Implements a persistent telemetry event queue with multi-instance support, ensuring reliable event delivery and adding diagnostic commands for management.

  • Behavior:
    • Implements a persistent telemetry event queue in QueuedTelemetryClient with FIFO structure and persistence using globalState API.
    • Multi-instance support with distributed locking and leader election modes.
    • Events removed only after successful cloud transmission.
    • Automatic processing on event addition and extension startup.
    • 1MB storage limit with automatic cleanup of oldest events.
  • Multi-Instance Support:
    • Distributed locking ensures only one instance processes the queue at a time.
    • Unique instance IDs for tracking and 30-second lock expiration.
    • Modes: compete, leader, disabled for different coordination strategies.
  • Diagnostic Commands:
    • Adds telemetryQueueStatus, telemetryQueueProcess, and telemetryQueueClear commands for queue management.
  • Integration:
    • Modifies CloudService to use QueuedTelemetryClient.
    • Processes queued events on extension startup in extension.ts.
    • Integrates diagnostic commands in registerCommands.ts.
  • Testing:
    • 38 tests covering queue operations, storage limits, multi-instance scenarios, and edge cases.
  • Benefits:
    • Ensures no lost or duplicate events, ordered delivery, and clear debugging with instance-specific logs.

This description was created by Ellipsis for b1806e2. You can customize this summary. It will automatically update as commits are pushed.

…support

- Add FIFO queue structure that maintains event order
- Implement persistence using VS Code's globalState API
- Events only removed after successful cloud transmission
- Automatic processing when new events are added
- 1MB storage limit with automatic cleanup of oldest events
- Multi-instance coordination with distributed locking
- Prevent concurrent queue processing across VS Code instances
- Add diagnostic commands for queue monitoring
- Comprehensive test coverage (38 tests)

Ensures reliable telemetry delivery even during network outages
and prevents duplicate processing when multiple instances are running.
Copilot AI review requested due to automatic review settings August 2, 2025 00:26
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Aug 2, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive persistent telemetry event queue system with sophisticated multi-instance coordination to ensure reliable delivery of telemetry events even during network outages or VS Code crashes. The system maintains strict FIFO processing order and prevents duplicate event processing when multiple VS Code instances are running.

  • Core Queue Implementation: FIFO queue with persistence, confirmation-based removal, and 1MB storage limit with automatic cleanup
  • Multi-Instance Coordination: Distributed locking with instance identification, lock expiration handling, and configurable modes (compete/leader/disabled)
  • Diagnostic Commands: Three new commands for monitoring queue status, manual processing, and clearing events

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/extension.ts Adds queue processing on extension startup to handle events from previous sessions
src/activate/registerCommands.ts Implements three diagnostic commands for queue management and status monitoring
packages/types/src/vscode.ts Adds command IDs for the new telemetry queue diagnostic commands
packages/cloud/src/queue/types.ts Defines comprehensive type system for queue operations and multi-instance coordination
packages/cloud/src/queue/index.ts Exports all queue-related classes and types for external consumption
packages/cloud/src/queue/TelemetryEventQueue.ts Core queue implementation with FIFO processing, retry logic, and multi-instance support
packages/cloud/src/queue/QueuedTelemetryClient.ts Telemetry client wrapper that intercepts events and queues them for reliable delivery
packages/cloud/src/queue/GlobalStateQueueStorage.ts Persistent storage implementation using VS Code globalState with atomic operations
packages/cloud/src/queue/CloudQueueProcessor.ts Handles cloud transmission of queued events with retry logic for different error types
packages/cloud/src/index.ts Updates package exports to include queue functionality
packages/cloud/src/CloudService.ts Integrates queued telemetry client and provides queue management methods
Comments suppressed due to low confidence (1)

packages/cloud/src/queue/QueuedTelemetryClient.ts:61

  • Using bracket notation to access private property multiInstanceConfig breaks encapsulation. Consider adding a public getter method to GlobalStateQueueStorage for accessing this configuration.
		const interval = this.storage["multiInstanceConfig"].lockCheckIntervalMs || 5000

}

// Get queue status
const clients = (TelemetryService.instance as any).clients || []
Copy link

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using type assertion as any bypasses TypeScript's type safety. Consider defining proper types or interfaces for TelemetryService to access the clients property safely.

Suggested change
const clients = (TelemetryService.instance as any).clients || []
const clients =
(TelemetryService.instance as TelemetryServiceWithClients).clients || [];

Copilot uses AI. Check for mistakes.
Comment on lines +249 to +252
const clients = (TelemetryService.instance as any).clients || []
const queuedClient = clients.find((c: any) => c instanceof QueuedTelemetryClient) as
| QueuedTelemetryClient
| undefined
Copy link

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code pattern is duplicated in multiple command handlers. Consider extracting a helper function getQueuedTelemetryClient() to reduce code duplication and improve maintainability.

Suggested change
const clients = (TelemetryService.instance as any).clients || []
const queuedClient = clients.find((c: any) => c instanceof QueuedTelemetryClient) as
| QueuedTelemetryClient
| undefined
const queuedClient = getQueuedTelemetryClient()

Copilot uses AI. Check for mistakes.
Comment on lines 119 to 120
multiInstanceEnabled: this.storage["multiInstanceConfig"].enabled,
multiInstanceMode: this.storage["multiInstanceConfig"].mode,
Copy link

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bracket notation to access private property multiInstanceConfig breaks encapsulation. This pattern appears multiple times and should use a proper getter method instead.

Suggested change
multiInstanceEnabled: this.storage["multiInstanceConfig"].enabled,
multiInstanceMode: this.storage["multiInstanceConfig"].mode,
multiInstanceEnabled: this.storage.getMultiInstanceConfig().enabled,
multiInstanceMode: this.storage.getMultiInstanceConfig().mode,

Copilot uses AI. Check for mistakes.
Comment on lines 185 to 191
const sizeInBytes = this.calculateSize(events)
if (sizeInBytes > this.maxStorageSize) {
// Remove oldest events until we're under the limit
let removedCount = 0
while (events.length > 0 && this.calculateSize(events) > this.maxStorageSize) {
events.shift() // Remove oldest event (FIFO)
removedCount++
Copy link

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculateSize method is called in each loop iteration, which involves JSON.stringify and TextEncoder operations. Consider calculating size once and subtracting the size of removed events to improve performance.

Suggested change
const sizeInBytes = this.calculateSize(events)
if (sizeInBytes > this.maxStorageSize) {
// Remove oldest events until we're under the limit
let removedCount = 0
while (events.length > 0 && this.calculateSize(events) > this.maxStorageSize) {
events.shift() // Remove oldest event (FIFO)
removedCount++
let sizeInBytes = this.calculateSize(events)
if (sizeInBytes > this.maxStorageSize) {
// Remove oldest events until we're under the limit
let removedCount = 0
while (events.length > 0 && sizeInBytes > this.maxStorageSize) {
const removedEvent = events.shift() // Remove oldest event (FIFO)
removedCount++
if (removedEvent) {
sizeInBytes -= this.calculateEventSize(removedEvent)
}

Copilot uses AI. Check for mistakes.
clearInterval(this.lockReleaseTimer)
}

const renewalInterval = Math.max((this.options.multiInstance?.lockDurationMs || 30000) / 3, 5000)
Copy link

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic numbers /3 and 5000 should be extracted as named constants to improve code readability and maintainability. Consider defining MIN_RENEWAL_INTERVAL_MS and LOCK_RENEWAL_DIVISOR.

Suggested change
const renewalInterval = Math.max((this.options.multiInstance?.lockDurationMs || 30000) / 3, 5000)
const renewalInterval = Math.max(
(this.options.multiInstance?.lockDurationMs || 30000) / LOCK_RENEWAL_DIVISOR,
MIN_RENEWAL_INTERVAL_MS
)

Copilot uses AI. Check for mistakes.
* Set up periodic queue processing for leader mode
*/
private setupPeriodicProcessing(): void {
const interval = this.storage["multiInstanceConfig"].lockCheckIntervalMs || 5000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using bracket notation to access the internal multi-instance config (e.g. this.storage["multiInstanceConfig"]). Consider exposing it via a typed getter for better readability and type safety.


// Try to acquire lock if multi-instance is enabled
let hasLock = false
if (isMultiInstanceEnabled && this.storage instanceof GlobalStateQueueStorage) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code uses 'instanceof GlobalStateQueueStorage' to determine multi-instance locking. Consider abstracting multi-instance behavior behind an interface to avoid direct type checks.

}

// Get queue status
const clients = (TelemetryService.instance as any).clients || []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid casting TelemetryService.instance to 'any' to access 'clients'. Consider exposing a proper API to retrieve the QueuedTelemetryClient from TelemetryService.

@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Aug 2, 2025
Copy link
Contributor

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! I've reviewed the changes and found several issues that need attention. The implementation of the persistent telemetry queue with multi-instance support is comprehensive, but there are some critical issues around encapsulation and potential memory leaks that should be addressed.

* Set up periodic queue processing for leader mode
*/
private setupPeriodicProcessing(): void {
const interval = this.storage["multiInstanceConfig"].lockCheckIntervalMs || 5000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bracket notation to access private property breaks encapsulation. Consider adding a public getter method to GlobalStateQueueStorage for accessing this configuration:

...status,
instanceInfo: {
...instanceInfo,
multiInstanceEnabled: this.storage["multiInstanceConfig"].enabled,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same encapsulation issue here. This should use a proper getter method instead of accessing private properties.

console.error(`[CloudQueueProcessor] Failed to process event ${event.id}:`, errorMessage)

// Store error for debugging
event.lastError = errorMessage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating the event parameter by setting could cause memory issues if errors accumulate. Consider returning an updated event object instead:

* Atomic compare-and-swap for lock
*/
private async compareAndSwapLock(expectedLock: QueueLock | null, newLock: QueueLock): Promise<boolean> {
// VS Code's globalState doesn't provide true CAS, so we implement optimistic locking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment acknowledges a race condition window. Consider adding a version/generation number to locks to make the compare-and-swap truly atomic:

}

// Get queue status
const clients = (TelemetryService.instance as any).clients || []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing private property is fragile. Consider adding a public method to TelemetryService for finding clients by type, or handle the case where this property might not exist.

// Lifecycle

public dispose(): void {
public async dispose(): Promise<void> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS Code's Disposable interface expects synchronous disposal, but this is now async. Consider handling the async cleanup differently, perhaps with a separate shutdown method that's called before dispose.

this.telemetryClient = new TelemetryClient(this.authService, this.settingsService)

// Configure multi-instance behavior
const multiInstanceConfig: MultiInstanceConfig = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider validating the multiInstanceConfig values (e.g., ensuring durations are positive, mode is valid) before using them.

- Add public getter method getMultiInstanceConfig() to GlobalStateQueueStorage to fix encapsulation issue
- Update QueuedTelemetryClient to use the new getter method instead of accessing private property
- Add mock for QueuedTelemetryClient in CloudService tests to fix failing unit tests
- All tests now passing on both Ubuntu and Windows platforms
- Remove debug, info, and warn logging from telemetry queue system
- Keep only error logging for actual failures
- This reduces log noise in production while still capturing important errors
- Fix ESLint warnings for unused variables
…per queue retry

- TelemetryClient now throws errors when authentication fails or session token is missing
- This allows the queue processor to properly identify failed sends and retry them later
- Fixes issue where telemetry events were being discarded when in inactive-session state
- Events will now be queued and retried when authentication is restored
@hannesrudolph
Copy link
Collaborator Author

Important Update on PR Status

I've analyzed this PR and discovered a critical architectural change that affects its viability:

The Issue

The directory where all the telemetry queue implementation resides has been removed from the repository in commit a1439c1 (merged on Aug 3, 2025). The cloud functionality has been moved to an external npm package .

What This Means

  • All the queue implementation files (, , etc.) are being added to a directory that no longer exists in the main branch
  • The PR cannot be merged in its current form as it's incompatible with the new architecture
  • The encapsulation issues mentioned in the review have been addressed, but this is now moot

Recommended Next Steps

  1. Close this PR - It cannot be merged as-is due to the architectural changes
  2. Re-implement in the npm package - The telemetry queue feature would need to be implemented in the npm package repository instead
  3. Create a new PR - Once implemented in the npm package, a new PR in this repo would be needed to update the dependency and integrate the queue functionality

I apologize for the confusion. This is an unfortunate timing issue where significant architectural changes occurred after this PR was created. The feature itself looks well-implemented, but it needs to be relocated to the appropriate repository.

@hannesrudolph hannesrudolph marked this pull request as draft August 5, 2025 18:05
@hannesrudolph hannesrudolph moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap Aug 5, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Aug 6, 2025
@github-project-automation github-project-automation bot moved this from PR [Draft / In Progress] to Done in Roo Code Roadmap Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PR - Draft / In Progress size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Add persistent retry queue for failed telemetry events to Roo Code Cloud

2 participants